Developer Interface
On this page
Developer Interface#
Configuration#
Certain functions of the internetarchive library require your archive.org credentials (i.e. uploading, modifying metadata, searching).
Your credentials and other configurations can be provided via a dictionary when instantiating an ArchiveSession
or Item
object, or in a config file.
The easiest way to create a config file is with the configure function:
>>> from internetarchive import configure
>>> configure('user@example.com', 'password')
Config files are stored in either $HOME/.ia
or $HOME/.config/ia.ini
by default. You can also specify your own path:
>>> from internetarchive import configure
>>> configure('user@example.com', 'password', config_file='/home/jake/.config/ia-alternate.ini')
Custom config files can be specified when instantiating an ArchiveSession
object:
>>> from internetarchive import get_session
>>> s = get_session(config_file='/home/jake/.config/ia-alternate.ini')
Or an Item
object:
>>> from internetarchive import get_item
>>> item = get_item('nasa', config_file='/home/jake/.config/ia-alternate.ini')
IA-S3 Configuration#
Your IA-S3 keys are required for uploading and modifying metadata. You can retrieve your IA-S3 keys at https://archive.org/account/s3.php.
They can be specified in your config file like so:
[s3]
access = mYaccEsSkEY
secret = mYs3cREtKEy
Or, using the ArchiveSession
object:
>>> from internetarchive import get_session
>>> c = {'s3': {'access': 'mYaccEsSkEY', 'secret': 'mYs3cREtKEy'}}
>>> s = get_session(config=c)
>>> s.access_key
'mYaccEsSkEY'
Logging Configuration#
You can specify logging levels and the location of your log file like so:
[logging]
level = INFO
file = /tmp/ia.log
Or, using the ArchiveSession
object:
>>> from internetarchive import get_session
>>> c = {'logging': {'level': 'INFO', 'file': '/tmp/ia.log'}}
>>> s = get_session(config=c)
By default logging is turned off.
Other Configuration#
By default all requests are HTTPS.
You can change this setting in your config file in the general
section:
[general]
secure = False
Or, using the ArchiveSession
object:
>>> from internetarchive import get_session
>>> s = get_session(config={'general': {'secure': False}})
In the example above, all requests will be made via HTTP.
ArchiveSession Objects#
The ArchiveSession object is subclassed from requests.Session
.
It collects together your credentials and config.
- get_session(config: Mapping | None = None, config_file: str | None = None, debug: bool = False, http_adapter_kwargs: MutableMapping | None = None) session.ArchiveSession [source]#
Return a new
ArchiveSession
object. TheArchiveSession
object is the main interface to theinternetarchive
lib. It allows you to persist certain parameters across tasks.- Parameters
config – A dictionary used to configure your session.
config_file – A path to a config file used to configure your session.
debug – To be passed on to this session’s method calls.
http_adapter_kwargs – Keyword arguments that
requests.adapters.HTTPAdapter
takes.
- Returns
To persist certain parameters across tasks.
Usage:
>>> from internetarchive import get_session >>> config = {'s3': {'access': 'foo', 'secret': 'bar'}} >>> s = get_session(config) >>> s.access_key 'foo'
From the session object, you can access all of the functionality of the
internetarchive
lib:>>> item = s.get_item('nasa') >>> item.download() nasa: ddddddd - success >>> s.get_tasks(task_ids=31643513)[0].server 'ia311234'
Item Objects#
Item
objects represent Internet Archive items.
From the Item
object you can create new items, upload files to existing items, read and write metadata, and download or delete files.
- get_item(identifier: str, config: Mapping | None = None, config_file: str | None = None, archive_session: session.ArchiveSession | None = None, debug: bool = False, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: MutableMapping | None = None) item.Item [source]#
Get an
Item
object.- Parameters
identifier – The globally unique Archive.org item identifier.
config – A dictionary used to configure your session.
config_file – A path to a config file used to configure your session.
archive_session – An
ArchiveSession
object can be provided via thearchive_session
parameter.debug – To be passed on to get_session().
http_adapter_kwargs – Keyword arguments that
requests.adapters.HTTPAdapter
takes.request_kwargs – Keyword arguments that
requests.Request
takes.
- Returns
The Item that fits the criteria.
- Usage:
>>> from internetarchive import get_item >>> item = get_item('nasa') >>> item.item_size 121084
Uploading#
Uploading to an item can be done using Item.upload()
:
>>> item = get_item('my_item')
>>> r = item.upload('/home/user/foo.txt')
>>> from internetarchive import upload
>>> r = upload('my_item', '/home/user/foo.txt')
The item will automatically be created if it does not exist.
Refer to archive.org Identifiers for more information on creating valid archive.org identifiers.
Setting Remote Filenames#
Remote filenames can be defined using a dictionary:
>>> from io import BytesIO
>>> fh = BytesIO()
>>> fh.write(b'foo bar')
>>> item.upload({'my-remote-filename.txt': fh})
- upload(identifier: str, files, metadata: Mapping | None = None, headers: dict | None = None, access_key: str | None = None, secret_key: str | None = None, queue_derive=None, verbose: bool = False, verify: bool = False, checksum: bool = False, delete: bool = False, retries: int | None = None, retries_sleep: int | None = None, debug: bool = False, validate_identifier: bool = False, request_kwargs: dict | None = None, **get_item_kwargs) list[requests.Request | requests.Response] [source]#
Upload files to an item. The item will be created if it does not exist.
- Parameters
identifier – The globally unique Archive.org identifier for a given item.
files – The filepaths or file-like objects to upload. This value can be an iterable or a single file-like object or string.
metadata – Metadata used to create a new item. If the item already exists, the metadata will not be updated – use
modify_metadata
.headers – Add additional HTTP headers to the request.
access_key – IA-S3 access_key to use when making the given request.
secret_key – IA-S3 secret_key to use when making the given request.
queue_derive – Set to False to prevent an item from being derived after upload.
verbose – Display upload progress.
verify – Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.
checksum – Skip uploading files based on checksum.
delete – Delete local file after the upload has been successfully verified.
retries – Number of times to retry the given request if S3 returns a 503 SlowDown error.
retries_sleep – Amount of time to sleep between
retries
.debug – Set to True to print headers to stdout, and exit without sending the upload request.
validate_identifier – Set to True to validate the identifier before uploading the file.
**kwargs – Optional arguments that
get_item
takes.
- Returns
A list Requests if debug else a list of Responses.
Metadata#
- modify_metadata(identifier: str, metadata: Mapping, target: str | None = None, append: bool = False, append_list: bool = False, priority: int = 0, access_key: str | None = None, secret_key: str | None = None, debug: bool = False, request_kwargs: Mapping | None = None, **get_item_kwargs) requests.Request | requests.Response [source]#
Modify the metadata of an existing item on Archive.org.
- Parameters
identifier – The globally unique Archive.org identifier for a given item.
metadata – Metadata used to update the item.
target – The metadata target to update. Defaults to metadata.
append – set to True to append metadata values to current values rather than replacing. Defaults to
False
.append_list – Append values to an existing multi-value metadata field. No duplicate values will be added.
priority – Set task priority.
access_key – IA-S3 access_key to use when making the given request.
secret_key – IA-S3 secret_key to use when making the given request.
debug – set to True to return a
requests.Request
object instead of sending request. Defaults toFalse
.**get_item_kwargs – Arguments that
get_item
takes.
- Returns
A Request if debug else a Response.
The default target to write to is metadata
.
If you would like to write to another target, such as files
, you can specify so using the target
parameter.
For example, if we had an item whose identifier was my_identifier
and you wanted to add a metadata field to a file within the item called foo.txt:
>>> r = modify_metadata('my_identifier', metadata={'title': 'My File'}, target='files/foo.txt')
>>> from internetarchive import get_files
>>> f = list(get_files('iacli-test-item301', 'foo.txt'))[0]
>>> f.title
'My File'
You can also create new targets if they don’t exist:
>>> r = modify_metadata('my_identifier', metadata={'foo': 'bar'}, target='extra_metadata')
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> item.item_metadata['extra_metadata']
{'foo': 'bar'}
Downloading#
- download(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, dry_run: bool = False, verbose: bool = False, ignore_existing: bool = False, checksum: bool = False, destdir: str | None = None, no_directory: bool = False, retries: int | None = None, item_index: int | None = None, ignore_errors: bool = False, on_the_fly: bool = False, return_responses: bool = False, no_change_timestamp: bool = False, timeout: int | float | tuple[int, float] | None = None, **get_item_kwargs) list[requests.Request | requests.Response] [source]#
Download files from an item.
- Parameters
identifier – The globally unique Archive.org identifier for a given item.
files – Only return files matching the given file names.
formats – Only return files matching the given formats.
glob_pattern – Only return files matching the given glob pattern.
dry_run – Print URLs to files to stdout rather than downloading them.
verbose – Turn on verbose output.
ignore_existing – Skip files that already exist locally.
checksum – Skip downloading file based on checksum.
destdir – The directory to download files to.
no_directory – Download files to current working directory rather than creating an item directory.
retries – The number of times to retry on failed requests.
item_index – The index of the item for displaying progress in bulk downloads.
ignore_errors – Don’t fail if a single file fails to download, continue to download other files.
on_the_fly – Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
return_responses – Rather than downloading files to disk, return a list of response objects.
**kwargs – Optional arguments that
get_item
takes.
- Returns
A list Requests if debug else a list of Responses.
Deleting#
- delete(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, cascade_delete: bool = False, access_key: str | None = None, secret_key: str | None = None, verbose: bool = False, debug: bool = False, **kwargs) list[requests.Request | requests.Response] [source]#
Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.
- Parameters
identifier – The globally unique Archive.org identifier for a given item.
files – Only return files matching the given filenames.
formats – Only return files matching the given formats.
glob_pattern – Only return files matching the given glob pattern.
cascade_delete – Delete all files associated with the specified file, including upstream derivatives and the original.
access_key – IA-S3 access_key to use when making the given request.
secret_key – IA-S3 secret_key to use when making the given request.
verbose – Print actions to stdout.
debug – Set to True to print headers to stdout and exit exit without sending the delete request.
- Returns
A list Requests if debug else a list of Responses
File Objects#
- get_files(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, exclude_pattern: str | None = None, on_the_fly: bool = False, **get_item_kwargs) list[files.File] [source]#
Get
File
objects from an item.- Parameters
identifier – The globally unique Archive.org identifier for a given item.
files – Only return files matching the given filenames.
formats – Only return files matching the given formats.
glob_pattern – Only return files matching the given glob pattern.
exclude_pattern – Exclude files matching the given glob pattern.
on_the_fly – Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
**get_item_kwargs – Arguments that
get_item()
takes.
- Returns
Files from an item.
- Usage:
>>> from internetarchive import get_files >>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')] >>> print(fnames) ['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']
Searching Items#
- search_items(query: str, fields: Iterable | None = None, sorts=None, params: Mapping | None = None, full_text_search: bool = False, dsl_fts: bool = False, archive_session: session.ArchiveSession | None = None, config: Mapping | None = None, config_file: str | None = None, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: Mapping | None = None, max_retries: int | Retry | None = None) search.Search [source]#
Search for items on Archive.org.
- Parameters
query – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.
fields – The metadata fields to return in the search results.
params – The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.
full_text_search – Beta support for querying the archive.org Full Text Search API [default: False].
dsl_fts – Beta support for querying the archive.org Full Text Search API in dsl (i.e. do not prepend
!L `` to the ``full_text_search
query [default: False].secure – Configuration options for session.
config_file – A path to a config file used to configure your session.
http_adapter_kwargs – Keyword arguments that
requests.adapters.HTTPAdapter
takes.request_kwargs – Keyword arguments that
requests.Request
takes.max_retries –
The number of times to retry a failed request. This can also be an urllib3.Retry object. If you need more control (e.g. status_forcelist), use a ArchiveSession object, and mount your own adapter after the session object has been initialized. For example:
>>> s = get_session() >>> s.mount_http_adapter() >>> search_results = s.search_items('nasa')
See
ArchiveSession.mount_http_adapter()
for more details.
- Returns
A
Search
object, yielding search results.
Internet Archive Tasks#
- get_tasks(identifier: str = '', params: dict | None = None, config: Mapping | None = None, config_file: str | None = None, archive_session: session.ArchiveSession | None = None, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: MutableMapping | None = None) set[catalog.CatalogTask] [source]#
Get tasks from the Archive.org catalog.
- Parameters
identifier – The Archive.org identifier for which to retrieve tasks for.
params – The URL parameters to send with each request sent to the Archive.org catalog API.
- Returns
A set of
CatalogTask
objects.